Learning from Little: Comparison of Classifiers Given Little Training
نویسندگان
چکیده
Many real-world machine learning tasks are faced with the problem of small training sets. Additionally, the class distribution of the training set often does not match the target distribution. In this paper we compare the performance of many learning models on a substantial benchmark of binary text classification tasks having small training sets. We vary the training size and class distribution to examine the learning surface, as opposed to the traditional learning curve. The models tested include various feature selection methods each coupled with four learning algorithms: Support Vector Machines (SVM), Logistic Regression, Naive Bayes, and Multinomial Naive Bayes. Different models excel in different regions of the learning surface, leading to meta-knowledge about which to apply in different situations. This helps guide the researcher and practitioner when facing choices of model and feature selection methods in, for example, information retrieval settings and others. 1 Motivation & Scope Our goal is to advance the state of meta-knowledge about selecting which learning models to apply in which situations. Consider these four motivations: 1. Information Retrieval: Suppose you are building an advanced search interface. As the user sifts through the each page of ten search results, it trains a classifier on the fly to provide a ranking of the remaining results based on the user’s positive or negative indication on each result shown thus far. Which learning model should you implement to provide greatest precision under the conditions of little training data and a markedly skewed class distribution? 2. Semi-Supervised Learning: When learning from small training sets, it is natural to try to leverage the many unlabeled examples. A common first phase in such algorithms is to train an initial classifier with the little data available and apply it to select additional predicted-positive examples and predicted-negative examples from the unlabeled data to augment the training set before learning the final classifier (e.g. [1]). With a poor choice for the initial learning model, the augmented examples will pollute the training set. Which learning model is most appropriate for the initial classifier? Table 1. Summary of test conditions we vary. P = 1..40 Positives in training set Feature Selection Metrics: N = 1..200 Negatives in training set IG Information Gain FX= 10..1000 Features selected BNS Bi-Normal Separation Performance Metrics: Learning Algorithms: TP10 True positives in top 10 NB Naive Bayes TN100 True negatives in bottom 100 Multi Multinomial Naive Bayes F-measure 2×precision×recall÷(precision+recall) Log Logistic Regression (harmonic avg. of precision & recall) SVM Support Vector Machine 3. Real-World Training Sets: In many real-world projects the training set must be built up from scratch over time. The period where there are only a few training examples is especially long if there are many classes, e.g. 30–500. Ideally, one would like to be able train the most effective classifiers at any point. Which methods are most effective with little training? 4. Meta-Knowledge: Testing all learning models on each new classification task at hand is an agnostic and inefficient route to building high quality classifiers. The research literature must continue to strive to give guidance to the practitioner as to which (few) models are most appropriate in which situations. Furthermore, in the common situation where there is a shortage of training data, cross-validation for model selection can be inappropriate and will likely lead to over-fitting. Instead, one may follow the a priori guidance of studies demonstrating that some learning models are superior to others over large benchmarks. In order to provide such guidance, we compare the performance of many learning models (4 induction algorithms × feature selection variants) on a benchmark of hundreds of binary text classification tasks drawn from various benchmark databases, e.g, Reuters, TREC, and OHSUMED. To suit the real-world situations we have encountered in industrial practice, we focus on tasks with small training sets and a small proportion of positives in the test distribution. Note that in many situations, esp. information retrieval or fault detection, the ratio of positives and negatives provided in the training set is unlikely to match the target distribution. And so, rather than explore a learning curve with matching distributions, we explore the entire learning surface, varying the number of positives and negatives in the training set independently of each other (from 1 to 40 positives and 1 to 200 negatives). This contrasts with most machine learning research, which tests under conditions of (stratified) crossvalidation or random test/train splits, preserving the distribution. The learning models we evaluate are the cross product of four popular learning algorithms (Support Vector Machines, Logistic Regression, Naive Bayes, Multinomial Naive Bayes), two highly successful feature selection metrics (Information Gain, Bi-Normal Separation) and seven settings for the number of top-ranked features to select, varying from 10 to 1000. We examine the results from several perspectives: precision in the top-ranked items, precision for the negative class in the bottom-ranked items, and F-measure, each being appropriate for different situations. For each perspective, we determine which models consistently perform well under varying amounts of training data. For example, Multinomial Naive Bayes coupled with feature selection via Bi-Normal Separation can be closely competitive to SVMs for precision, performing significantly better when there is a scarcity of positive training examples. The rest of the paper is organized as follows. The remainder of this section puts this study in context to related work. Section 2 details the experiment protocol. Section 3 gives highlights of the results with discussion. Section 4 concludes with implications and future directions.
منابع مشابه
Impact of Training Set Size on Object-Based Land Cover Classification: A Comparison of Three Classifiers
Supervised classifiers are commonly employed in remote sensing to extract land cover information, but various factors affect their accuracy. The number of available training samples, in particular, is known to have a significant impact on classification accuracies. Obtaining a sufficient number of samples is, however, not always practical. The support vector machine (SVM) is a supervised classi...
متن کاملAn Investigation of the Relationship between L2 Learning Styles and Teaching Methodologies in EFL Classes
Individual differences have always been a key element in the success and failure of learners in language classrooms. Learners come to EFL classes with various learning styles and teachers utilize different methodologies targeting different needs of the learners which may have important effects on the quality of the learning environment. In this study a comparison is made between learning styles...
متن کاملApplying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks
Given the large training set available for batch filtering, choosing a supervised learning algorithm that would make effective use of this data was critical. The support vector machine approach (SVM) to training linear classifiers has outperformed competing approaches in a number of recent text categorization studies, particularly for categories with substantial numbers of positive training exa...
متن کاملIRDDS: Instance reduction based on Distance-based decision surface
In instance-based learning, a training set is given to a classifier for classifying new instances. In practice, not all information in the training set is useful for classifiers. Therefore, it is convenient to discard irrelevant instances from the training set. This process is known as instance reduction, which is an important task for classifiers since through this process the time for classif...
متن کاملDetecting Cognitive States Using Machine Learning
Very little is known about the relationship between the cognitive states and the fMRI data, and very little is known about the feasibility of training classifiers to decode cognitive states. Our efforts aimed to automatically discover which spatial-temporal patterns in the fMRI data indicate a subject is performing a specific cognitive task, such as watching a picture or sentence. We developed ...
متن کامل